Skip to content

Add protobuf support for lambdas#22362

Open
gstvg wants to merge 8 commits into
apache:mainfrom
gstvg:lambda_protobuf_new
Open

Add protobuf support for lambdas#22362
gstvg wants to merge 8 commits into
apache:mainfrom
gstvg:lambda_protobuf_new

Conversation

@gstvg
Copy link
Copy Markdown
Contributor

@gstvg gstvg commented May 19, 2026

Which issue does this PR close?

Part of #21172

Rationale for this change

Protobuf support wasn't implemented in main lambda PR to not make it even bigger

What changes are included in this PR?

Protobuf encoding and decoding (~1000 LOC in generated files, ~210 impl, ~400 tests)

Are these changes tested?

Unit tests, similar to the existing ones for scalar functions

Are there any user-facing changes?

None

@github-actions github-actions Bot added the proto Related to proto crate label May 19, 2026
return Err(Error::General(
"Proto serialization error: Lambda not implemented".to_string(),
));
Expr::HigherOrderFunction(HigherOrderFunction { func, args }) => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: instead of appending this, can we move this next to the rest of functions? (Scalar, Aggregate etc)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much easier to cross check the impls now 5688f3f thanks
(also did this to others similar matchs)

Comment on lines +434 to +436
HigherOrderUDFExprNode higher_order_udf_expr = 37;
Lambda lambda = 38;
LambdaVariable lambda_variable = 39;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit as well: Can we move this next to the other UDFs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't better being ordered by field id? so the next one who add a node can spot the next available field id right away

expr_id,
expr_type: Some(protobuf::physical_expr_node::ExprType::LambdaVariable(
PhysicalLambdaVariableExprNode {
index: var.index() as u32,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now we don't use this so it is always 0 no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used now after lambda capture got merged. The variable below has index 1, for example:

01)ProjectionExec: expr=[array_transform(make_array(number@0), (v) -> CAST(v@1 AS Float64) + 3) as array_transform(make_array(t.number),(v) -> v + Float64(3))]

Comment on lines +586 to +590
protobuf::PhysicalHigherOrderUdfNode {
name: expr.name().to_string(),
args: serialize_physical_exprs(expr.args(), codec, proto_converter)?,
fun_definition: (!buf.is_empty()).then_some(buf),
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we encode the return_field instead of resolving it from the schema when decoding (like ScalarUDF)?
Not really sure what would be the exact disadvantage of keeping it at is, but wondering if it would be cleaner to make the return type explicit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the HigherOrderFunctionExpr::new, the only constructor that receives return_field as arg, from the lambda PR, due to possible type mismatch, so it could be discussed on it's own PR when needed. See #21679 (review) and

FYI, the return field might not match the function output after the new arguments, but you don't have the schema here so you cant check that and I see ScalarFunctionExpr have the same problem

from https://github.com/apache/datafusion/pull/21679/changes#r3109735988 (quoting directly since the github link anchor doesn't work well)

let serialize_name = extract_function_name(&expr);
let deserialize_name = extract_function_name(&deserialize);

assert_eq!(serialize_name, deserialize_name);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we assert directly on expr and deserialize instead? since the Hofs implement PartialEq I think it should work
assert_eq!(expr, deserialized_expr);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 9a3f30c thanks
This was based on test_expression_serialization_roundtrip which checks only names but asserting directly on expr is definitively better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants